12-cs01-eda

Professor Shannon Ellis

11/7/23

CS01: Biomarkers of Recent Use (EDA)

Q&A

Q: was curious about the sensitivity/cut-offs/specificity, but we mainly discussed it in class already; was still slightly confused about it?
A: We’ll discuss this in detail later this week and early next week!

Q: Are Youden’s indices related to ROC curves?
A: Related, yes! We’ll discuss both soon!

Q: I wasn’t sure how to interpret some of the visuals towards the end of lecture.
A: That’s OK! We’ll be recreating these and discussing them more as we do this case study in class.

Q: Did they actually use intravenous blood draws or just thumb pricks because multiple intravenous would not be fun.
A: It was venous blood from the arm. This unfunness is one of the reasons participants were compensated.

Q: Did this study (or other studies on THC) impairment end up influencing any legislation at the local or state level?
A: Great question! The state is currently reviewing these and other study’s data. The state was definitely aware of this study and waited (im)patiently while we analyzed and worked to publish.

Q: How long should our reports be?
A: It’s hard to say. We’ll discuss an example today so you have a sense!

Q: Regarding the final project, will there be a Google form that we can fill out that will help us form groups if we can’t form one ourselves?  A: Yup - I’d say try to find a group using Piazza, in class, or during lab. However, if you’re unable, when you fill out the form to indicate your group next week, you’ll select that you’d like to be placed into a group.

Course Announcements

Due Dates:

  • 🔬 Lab05 (MLR) due Friday
  • Mid-course survey “due” (for EC) Friday
  • 💻 HW03 (MLR) due Mon 11/20

Notes:

  • CS01 Groups have been sent out
    • email for contact
    • GitHub repo <- please accept and open; make sure you have access
    • group mate feedback is required
    • if you made changes to repo yesterday, be sure to pull to get data in your repo
  • Final Project: can use Piazza to help find group mates

Important

The CS01 data are data for you only. My collaborator is excited that y’all will be working on this…but these are still research data, so please do not share with others or post publicly.

Agenda

  • Previous Projects
  • Exploring the Data

Previous Case Studies

Example Case Study

See & Discuss: https://cogs137.github.io/website/content/cs/cs-example.html

Feedback & Scores

Feedback to other students here

Common comments:

  • context/explanation/guidance/lacking
  • missing citations
  • failure to introduce/describe the data
  • making statements without evidence
  • need to edit for cohesiveness, story, clarity

An (Example) Rubric

This is NOT the rubric for your case study, but it will be similar:

Notes

  • Lots of code/plots will be provided here
  • You are free to include any of it in your own case study (no attribution needed)
  • You probably should NOT include all of them in your final report
  • For any of the “basic” plots you include in your report, you’ll want to clean them up/improve their design.
  • Your final report should be polished from start to finish

Data & Files

Packages

Two additional packages required for these notes:

library(tidyverse)
library(purrr)
library(rstatix)

Our Datasets

Three matrices:

  • Blood (WB): 8 compounds; 190 participants
  • Oral Fluid (OF): 7 compounds; 192 participants
  • Breath (BR): 1 compound; 191 participants

Variables:

  • ID | participants identifier
  • Treatment | placebo, 5.90%, 13.40%
  • Group | Occasional user, Frequent user
  • Timepoint | indicator of which point in the timeline participant’s collection occurred
  • time.from.start | number of minutes from consumption
  • & measurements for individual compounds

The Data

Reading in the .RData we wrote at the end of the last set of notes…(using load)

load("data/compounds.RData")
load("data/timepoints.RData")
load("data/data_clean.RData")

This reads the objects stored in these files into your Environment for use.

What to do with all of these functions?

For example…we discussed this function in the last set of notes.

 drop_dups <- function(dataset){
  out <- dataset |> 
    filter(!is.na(timepoint_use)) |> 
    group_by(timepoint_use) |> 
    distinct(id, .keep_all=TRUE) |> 
    ungroup()
  return(out)
 } 

We’re going to have a lot of functions throughout…like this helper function to clean up names

# helper function to clean up name of two compounds
clean_gluc <- function(df){
  df <- df %>% 
    mutate(compound=gsub('GLUC', 'gluc',gsub("_","-",toupper(compound))),
           compound=gsub('THCOH', '11-OH-THC', compound))
  return(df)
}

Functions can/should be stored in a separate .R file, probably in a src/ directory.

To have access to the functions in that file…

source("path/to/file")
source("src/cs01_functions.R")

EDA

Single Variable (basic) plots

For a single compound…

ggplot(WB, aes(x=thc)) + geom_histogram()

But, with wide data, that’s not easy to do for all compounds, so you may want to pivot those data….

WB_long <- WB |> 
  pivot_longer(6:13) |>
  rename("fluid" = "fluid_type")

Distribtions across all compounds (WB):

ggplot(WB_long, aes(x=value)) + 
  geom_histogram() +
  facet_wrap(~name)

Now the same for OF and BR:

OF_long <- OF |> pivot_longer(6:12)
BR_long <- BR |> pivot_longer(6)

Combining long datasets:

df_full <- bind_rows(WB_long, OF_long, BR_long)

Plotting some of these data…

df_full |>
  mutate(group_compound = paste0(fluid,": ", name)) |>
ggplot(aes(x=value)) + 
  geom_histogram() + 
  facet_wrap(~group_compound, scales="free")

Two variables at a time

THC & Frequency

df_full |> 
  filter(name=="thc") |>
  ggplot(aes(x=group, y=value)) + 
  geom_boxplot() +
  facet_wrap(~fluid, scales="free")

THC & Treatment Group

df_full |> 
  filter(name=="thc") |>
  ggplot(aes(x=treatment, y=value)) + 
  geom_boxplot() +
  facet_wrap(~fluid, scales="free")

Focus on a specific timepoint…

df_full |> 
  filter(name=="thc", timepoint=="T2A") |>
  ggplot(aes(x=treatment, y=value)) + 
  geom_boxplot() +
  facet_wrap(~fluid, scales="free")

At this point…

We start to get a sense of the data with these quick and dirty plots, but we’re really only scratching the surface of what’s going on in these data.

These data require a lot of exploration due to the number of compounds, multiple matrices, and data over time aspects.

Compounds across time

compound_scatterplot_group <- function(dataset, compound, timepoints){
  if(max(dataset[,compound],na.rm=TRUE)==0){
    print(
      dataset %>% 
        filter(!is.na(time_from_start)) %>%
        ggplot(aes_string(x="time_from_start", 
                          y=compound,
                          color="group")) + 
        geom_point() +
        geom_vline(data=timepoints, aes(xintercept=as.numeric(stop)), 
                   linetype="dashed", 
                   color="gray28") +
        scale_color_manual(values=c("#19831C", "#A27FC9")) +
        scale_y_continuous(limits=c(0,3)) +
        theme_classic() +
        theme(legend.position="bottom",
              legend.title=element_blank()) +
        labs(x='Time From Start (min)',
             y=gsub('GLUC', 'gluc',gsub("_", "-", toupper(compound))))
    )}else{
      print(
        dataset %>% 
          filter(!is.na(time_from_start)) %>%
          ggplot(aes_string(x="time_from_start", 
                            y=compound,
                            color="group")) + 
          geom_point() +
          geom_vline(data=timepoints, aes(xintercept=as.numeric(stop)), 
                     linetype="dashed", 
                     color="gray28")  +
          scale_color_manual(values=c("#19831C", "#A27FC9")) +
          theme_classic() +
          theme(legend.position="bottom",
                legend.title=element_blank()) +
          labs(x='Time From Start (min)',
               y=gsub('GLUC', 'gluc', gsub("_", "-", toupper(compound))))
      )
    }
}
scatter_wb <- map(compounds_WB, ~ compound_scatterplot_group( 
    dataset=WB_dups, 
    compound=.x, 
    timepoints=timepoints_WB))

scatter_of <- map(compounds_OF, ~ compound_scatterplot_group( 
    dataset=OF_dups, 
    compound=.x, 
    timepoints=timepoints_OF))

scatter_br <- map(compounds_BR, ~ compound_scatterplot_group( 
    dataset=BR_dups, 
    compound=.x, 
    timepoints=timepoints_BR))

Pairplots

pairs(WB[,unlist(compounds_WB)], 
      pch=19, 
      cex=0.3, 
      cex.labels=0.6,
      labels=gsub('GLUC','gluc',gsub("_","-",toupper(colnames(WB[,unlist(compounds_WB)])))))

pairs(OF[,unlist(compounds_OF)], 
      pch=19, 
      cex=0.4, 
      cex.labels=0.6,
      labels=gsub('GLUC','gluc',gsub("_","-",toupper(colnames(OF[,unlist(compounds_OF)])))))

❓ Why is there no pairplot for Breath?

Group Differences: Frequency of Use

compound_boxplot_group_only <- function(dataset, compounds, tissue, legend=TRUE, y_lab=TRUE){
  timepoint_to_use = levels(dataset$timepoint_use)[1]
  df <- dataset %>% 
    filter(timepoint_use == timepoint_to_use) %>%
    select(group, all_of(compounds)) 
  df <- df %>% 
    gather(compound, value, -group) %>% 
    clean_gluc() %>% 
    group_by(compound) %>% 
    mutate(y_max = 1.2*max(value)) %>% 
    group_by(compound, group) %>% 
    mutate(n = n(),
           my_label = paste0(group, ' N=', n),
           my_label =  gsub(" ", "\n", my_label))
  
  if(tissue == "Blood"){
    df$compound = factor(df$compound, levels=c("THC","11-OH-THC","THCCOOH","THCCOOH-gluc")) 
  }
  
  y_pos <- df %>% 
    group_by(compound) %>% 
    summarize(y.position = mean(y_max))
  
  stat.test <- df %>%
    group_by(compound) %>%
    t_test(value ~ my_label) %>%
    adjust_pvalue(method = "bonferroni") %>%
    add_significance()
  test <- stat.test %>%
    left_join(y_pos) %>%
    mutate(p.adj.signif = ifelse(p.adj.signif=='?', 'ns', p.adj.signif),
           p.adj = ifelse(p.adj < 0.001, "<0.001", p.adj))

  if(legend){
    leg_position = 'right'
  }else{
    leg_position = 'none'
  }
  
  if(y_lab){
    y_text = "Concentration (ng/mL)"
  }else{
    y_text = ''
  }
  
  medianFunction <- function(x){
    return(data.frame(y=round(median(x),1),label=round(median(x,na.rm=T),1)))}
  
  
  p2 <- ggplot(df, aes(x=my_label, y=value, fill=my_label)) + 
    geom_jitter(position=position_jitter(width=.3, height=0), size = 0.8, color="gray65")  +
    geom_boxplot(outlier.shape=NA, alpha=0.6) +
    stat_summary(fun = "median", geom = "point", shape = 19, size = 3, fill = "black") + 
    stat_summary(fun.data = medianFunction, geom ="text", color = "black", size = 3.5, vjust = -0.65) +
    facet_wrap(~compound, scales="free_y", ncol=4) +
    geom_blank(aes(y = y_max)) + 
    scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.1))) +
    # scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
    scale_fill_manual(values=c("#19831C", "#A27FC9")) +
    theme_classic() +
    theme(text = element_text(size=14),
          legend.position=leg_position,
          legend.title=element_blank(),
          panel.grid = element_blank(),
          strip.background = element_blank()) +
    labs(title = tissue,
         x = '',
         y = y_text) 
  
  ann_text <- test %>% 
    select(compound, p.adj, value = y.position, my_label=group1) %>%
    filter(p.adj < 0.05) %>% 
    mutate(x1 = 1, x2 = 2)
  
  if(tissue == "Whole Blood"){
    ann_text$compound = factor(ann_text$compound, 
                               levels=c("THC","11-OH-THC","THCCOOH","THCCOOH-gluc")) 
  }
  
  
  p2 + geom_text(data = ann_text, label = ann_text$p.adj, nudge_x = 0.5) +
    geom_segment(data = ann_text, aes(x = x1, xend = x2, 
                                      y = value - (0.04 * value), yend = value - (0.04*value)))
  
}
compound_boxplot_group_only(WB_dups, compounds=unlist(compounds_WB), tissue="Whole Blood")

compound_boxplot_group_only(OF_dups, compounds=unlist(compounds_OF), tissue="Oral Fluid")

compound_boxplot_group_only(BR_dups, compounds=unlist(compounds_BR), tissue="Breath")

Group Differences: Treatment

compound_boxplot_treatment <- function(dataset, compounds, tissue){
  timepoint_to_use=levels(dataset$timepoint_use)[2]
  df <- dataset %>% 
    filter(timepoint_use == timepoint_to_use) %>%
    select(treatment, group, compounds)
  df <- df %>% 
    gather(compound, value, -treatment, -group) %>% 
    clean_gluc()
  
  df %>% 
    ggplot(aes(x=treatment, y=value, fill=group)) + 
    # geom_jitter(color="gray36") +
    geom_boxplot(outlier.size=0.1) +
    scale_fill_manual(values=c("#19831C", "#A27FC9")) +
    facet_wrap(~compound, scales="free_y", ncol=4) +
    scale_x_discrete(labels=function(x) str_wrap(x, width=11)) +
    theme_classic(base_size=10) +
    theme(legend.position="bottom",
          legend.title=element_blank(),
          panel.grid=element_blank(),
          strip.background=element_blank()) +
    labs(title=tissue,
         x="Treatment",
         y="Measurement (ng/mL)")
  
}
compound_boxplot_treatment(WB_dups, compounds=unlist(compounds_WB), tissue="Whole Blood")

compound_boxplot_treatment(OF_dups, compounds=unlist(compounds_OF), tissue="Oral Fluid")

compound_boxplot_treatment(BR_dups, compounds=unlist(compounds_BR), tissue="Breath")

Compound Summaries

T2A_plot <- function(dataset, compound, timepoint_use=2){
  timepoint_to_use=levels(factor(dataset$timepoint_use))[timepoint_use]
  if(max(dataset[,compound],na.rm=TRUE)==0){
    print(
      ggplot(subset(dataset, timepoint_use==timepoint_to_use), 
             aes_string(x="group", 
                        y=compound, 
                        fill="group")) + 
        geom_boxplot(outlier.size=0.1) +
        scale_fill_manual(values=c("#19831C", "#A27FC9")) +
        scale_x_discrete(labels=function(x) str_wrap(x, width=10)) +
        scale_y_continuous(limits=c(0,3)) +
        facet_grid(~treatment) + 
        theme_classic() +
        theme(legend.position="none",
              panel.grid=element_blank(),
              strip.background=element_blank(),
              plot.title.position="plot") +
        labs(title=paste0('Timepoint: ', 
                            levels(dataset$timepoint_use)[timepoint_use],
                            ' post-smoking'),
             x='Group',
             y=gsub('GLUC', 'gluc',gsub("_", "-", toupper(compound))))
    )}else{
      print(
        
        ggplot(subset(dataset, timepoint_use==timepoint_to_use), 
               aes_string(x="group", 
                          y=compound, 
                          fill="group")) + 
          geom_boxplot(outlier.shape=NA, alpha=0.8) +
          scale_fill_manual(values=c("#19831C", "#A27FC9")) +
          scale_x_discrete(labels=function(x) str_wrap(x, width=10)) +
          facet_grid(~treatment) + 
          theme_classic() +
          theme(legend.position="none",
                panel.grid=element_blank(),
                strip.background=element_blank(),
              plot.title.position="plot") +
          labs(title=paste0('Timepoint: ', 
                              levels(dataset$timepoint_use)[timepoint_use],
                              ' post-smoking'),
               x='Group',
               y=gsub('GLUC', 'gluc',gsub("_","-",toupper(compound))))
      )
    }
}
post_wb <- map(compounds_WB, ~ T2A_plot( 
    dataset=WB_dups, 
    compound=.x))

post_of <- map(compounds_OF, ~ T2A_plot( 
    dataset=OF_dups, 
    compound=.x))

post_br <- map(compounds_BR, ~ T2A_plot( 
    dataset=BR_dups, 
    compound=.x))

Recap

  • Can you explain/describe the plots generated in the context of these data?
  • Can you generate EDA plots of your own for these data
  • Can you undersetand/work through the more complicated code provided (even if you couldn’t have come up with it on your own)